Goto

Collaborating Authors

 policy graph


MDP modeling for multi-stage stochastic programs

Morton, David P., Dowson, Oscar, Pagnoncelli, Bernardo K.

arXiv.org Artificial Intelligence

We study a class of multi-stage stochastic programs, which incorporate modeling features from Markov decision processes (MDPs). This class includes structured MDPs with continuous state and action spaces. We extend policy graphs to include decision-dependent uncertainty for one-step transition probabilities as well as a limited form of statistical learning. We focus on the expressiveness of our modeling approach, illustrating ideas with a series of examples of increasing complexity. As a solution method, we develop new variants of stochastic dual dynamic programming, including approximations to handle non-convexities.


IntellAgent: A Multi-Agent Framework for Evaluating Conversational AI Systems

Levi, Elad, Kadar, Ilan

arXiv.org Artificial Intelligence

Large Language Models (LLMs) are transforming artificial intelligence, evolving into task-oriented systems capable of autonomous planning and execution. One of the primary applications of LLMs is conversational AI systems, which must navigate multi-turn dialogues, integrate domain-specific APIs, and adhere to strict policy constraints. However, evaluating these agents remains a significant challenge, as traditional methods fail to capture the complexity and variability of real-world interactions. We introduce IntellAgent, a scalable, open-source multi-agent framework designed to evaluate conversational AI systems comprehensively. IntellAgent automates the creation of diverse, synthetic benchmarks by combining policy-driven graph modeling, realistic event generation, and interactive user-agent simulations. This innovative approach provides fine-grained diagnostics, addressing the limitations of static and manually curated benchmarks with coarse-grained metrics. IntellAgent represents a paradigm shift in evaluating conversational AI. By simulating realistic, multi-policy scenarios across varying levels of complexity, IntellAgent captures the nuanced interplay of agent capabilities and policy constraints. Unlike traditional methods, it employs a graph-based policy model to represent relationships, likelihoods, and complexities of policy interactions, enabling highly detailed diagnostics. IntellAgent also identifies critical performance gaps, offering actionable insights for targeted optimization. Its modular, open-source design supports seamless integration of new domains, policies, and APIs, fostering reproducibility and community collaboration. Our findings demonstrate that IntellAgent serves as an effective framework for advancing conversational AI by addressing challenges in bridging research and deployment. The framework is available at https://github.com/plurai-ai/intellagent


Intention-aware policy graphs: answering what, how, and why in opaque agents

Gimenez-Abalos, Victor, Alvarez-Napagao, Sergio, Tormos, Adrian, Cortés, Ulises, Vázquez-Salceda, Javier

arXiv.org Artificial Intelligence

However, precisely because of the definition of such a task, the result is an artefact that, unless explicitly designed to be transparent, is often not interpretable or, hence, trustworthy (Zhang et al., 2021; Lipton, 2017). This is where the field of Explainable Artificial Intelligence (XAI) shines through. A model explanation is an exercise in communication between a sender or source (i.e. the model or one of its components) and a receiver (i.e. the explainee, a human or another processor for a downstream task) that describes the relevant context or the causes surrounding some facts (Lewis, 1986; Miller, 2019; Wright, 2004), which in the context of AI is often related to its final or intermediary outputs or decisions. Any such communicative act can be considered an explanation, but not all explanations may be useful or even desirable. According to empirical studies (Slugoski et al., 1993), it can be argued that the form of an explanation must depend on its function as an answer to a question within a conversational framework. Furthermore, in the words of Herbert Paul Grice (Grice, 1975), for a communicative act to be useful, four maxims should be followed: 1. Manner: the message or explanans should be comprehensible and clear to the receiver, which within the context of XAI is often referred to as interpretability (Lipton, 2017), 2. Quality: the message contains truthful information; in the context of XAI, reliability or explanation verification (Zhou et al., 2021b; Slack et al., 2021; Arias-Duart et al., 2022), 3. Quantity: the length of a message should be just enough to be informative, often a heuristic implicitly agreed upon in the design of explainable systems which depends on both the sender and the code it uses, and 4. Relation: the explanation should be relevant to the given context, significant when one can keep searching for causes of causes beyond the scope of relevance.


AnyTOD: A Programmable Task-Oriented Dialog System

Zhao, Jeffrey, Cao, Yuan, Gupta, Raghav, Lee, Harrison, Rastogi, Abhinav, Wang, Mingqiu, Soltau, Hagen, Shafran, Izhak, Wu, Yonghui

arXiv.org Artificial Intelligence

We propose AnyTOD, an end-to-end, zero-shot task-oriented dialog (TOD) system capable of handling unseen tasks without task-specific training. We view TOD as a program executed by a language model (LM), where program logic and ontology is provided by a designer as a schema. To enable generalization to unseen schemas and programs without prior training, AnyTOD adopts a neuro-symbolic approach. A neural LM keeps track of events occurring during a conversation and a symbolic program implementing the dialog policy is executed to recommend next actions AnyTOD should take. This approach drastically reduces data annotation and model training requirements, addressing the enduring challenge of rapidly adapting a TOD system to unseen tasks and domains. We demonstrate state-of-the-art results on STAR, ABCD and SGD benchmarks. We also demonstrate strong zero-shot transfer ability in low-resource settings, such as zero-shot on MultiWOZ. In addition, we release STARv2, an updated version of the STAR dataset with richer annotations, for benchmarking zero-shot end-to-end TOD models.


General Policies, Serializations, and Planning Width

Bonet, Blai, Geffner, Hector

arXiv.org Artificial Intelligence

It has been observed that in many of the benchmark planning domains, atomic goals can be reached with a simple polynomial exploration procedure, called IW, that runs in time exponential in the problem width. Such problems have indeed a bounded width: a width that does not grow with the number of problem variables and is often no greater than two. Yet, while the notion of width has become part of the state-of-the-art planning algorithms like BFWS, there is still no good explanation for why so many benchmark domains have bounded width. In this work, we address this question by relating bounded width and serialized width to ideas of generalized planning, where general policies aim to solve multiple instances of a planning problem all at once. We show that bounded width is a property of planning domains that admit optimal general policies in terms of features that are explicitly or implicitly represented in the domain encoding. The results are extended to much larger class of domains with bounded serialized width where the general policies do not have to be optimal. The study leads also to a new simple, meaningful, and expressive language for specifying domain serializations in the form of policy sketches which can be used for encoding domain control knowledge by hand or for learning it from traces. The use of sketches and the meaning of the theoretical results are all illustrated through a number of examples.


Technical Report: The Policy Graph Improvement Algorithm

Pajarinen, Joni

arXiv.org Artificial Intelligence

Optimizing a partially observable Markov decision process (POMDP) policy is challenging. The policy graph improvement (PGI) algorithm for POMDPs represents the policy as a fixed size policy graph and improves the policy monotonically. Due to the fixed policy size, computation time for each improvement iteration is known in advance. Moreover, the method allows for compact understandable policies. This report describes the technical details of the PGI [1] and particle based PGI [2] algorithms for POMDPs in a more accessible way than [1] or [2] allowing practitioners and students to understand and implement the algorithms.


Column Generation Algorithms for Constrained POMDPs

Walraven, Erwin, Spaan, Matthijs T. J.

Journal of Artificial Intelligence Research

In several real-world domains it is required to plan ahead while there are finite resources available for executing the plan. The limited availability of resources imposes constraints on the plans that can be executed, which need to be taken into account while computing a plan. A Constrained Partially Observable Markov Decision Process (Constrained POMDP) can be used to model resource-constrained planning problems which include uncertainty and partial observability. Constrained POMDPs provide a framework for computing policies which maximize expected reward, while respecting constraints on a secondary objective such as cost or resource consumption. Column generation for linear programming can be used to obtain Constrained POMDP solutions. This method incrementally adds columns to a linear program, in which each column corresponds to a POMDP policy obtained by solving an unconstrained subproblem. Column generation requires solving a potentially large number of POMDPs, as well as exact evaluation of the resulting policies, which is computationally difficult. We propose a method to solve subproblems in a two-stage fashion using approximation algorithms. First, we use a tailored point-based POMDP algorithm to obtain an approximate subproblem solution. Next, we convert this approximate solution into a policy graph, which we can evaluate efficiently. The resulting algorithm is a new approximate method for Constrained POMDPs in single-agent settings, but also in settings in which multiple independent agents share a global constraint. Experiments based on several domains show that our method outperforms the current state of the art.


Solving DEC-POMDPs by Expectation Maximization of Value Function

Song, Zhao (Duke University) | Liao, Xuejun (Duke University) | Carin, Lawrence (Duke University)

AAAI Conferences

We present a new algorithm called PIEM to approximately solve for the policy of an infinite-horizon decentralized partially observable Markov decision process (DEC-POMDP). The algorithm uses expectation maximization (EM) only in the step of policy improvement, with policy evaluation achieved by solving the Bellman's equation in terms of finite state controllers (FSCs). This marks a key distinction of PIEM from the previous EM algorithm of (Kumar and Zilberstein, 2010), i.e., PIEM directly operates on a DEC-POMDP without transforming it into a mixture of dynamic Bayes nets. Thus, PIEM precisely maximizes the value function, avoiding complicated forward/backward message passing and the corresponding computational and memory cost. To overcome local optima, we follow (Pajarinen and Peltonen, 2011) to solve the DEC-POMDP for a finite length horizon and use the resulting policy graph to initialize the FSCs. We solve the finite-horizon problem using a modified point-based policy generation (PBPG) algorithm, in which a closed-form solution is provided which was previously found by linear programming in the original PBPG. Experimental results on benchmark problems show that the proposed algorithms compare favorably to state-of-the-art methods.


Learning Finite-State Controllers for Partially Observable Environments

Meuleau, Nicolas, Peshkin, Leonid, Kim, Kee-Eung, Kaelbling, Leslie Pack

arXiv.org Artificial Intelligence

Reactive (memoryless) policies are sufficient in completely observable Markov decision processes (MDPs), but some kind of memory is usually necessary for optimal control of a partially observable MDP. Policies with finite memory can be represented as finite-state automata. In this paper, we extend Baird and Moore's VAPS algorithm to the problem of learning general finite-state automata. Because it performs stochastic gradient descent, this algorithm can be shown to converge to a locally optimal finite-state controller. We provide the details of the algorithm and then consider the question of under what conditions stochastic gradient descent will outperform exact gradient descent. We conclude with empirical results comparing the performance of stochastic and exact gradient descent, and showing the ability of our algorithm to extract the useful information contained in the sequence of past observations to compensate for the lack of observability at each time-step.


Solving POMDPs by Searching the Space of Finite Policies

Meuleau, Nicolas, Kim, Kee-Eung, Kaelbling, Leslie Pack, Cassandra, Anthony R.

arXiv.org Artificial Intelligence

Solving partially observable Markov decision processes (POMDPs) is highly intractable in general, at least in part because the optimal policy may be infinitely large. In this paper, we explore the problem of finding the optimal policy from a restricted set of policies, represented as finite state automata of a given size. This problem is also intractable, but we show that the complexity can be greatly reduced when the POMDP and/or policy are further constrained. We demonstrate good empirical results with a branch-and-bound method for finding globally optimal deterministic policies, and a gradient-ascent method for finding locally optimal stochastic policies.